XP-002257548 



Kernel Services 



< — — — 

p-D. Zqo..! * 



T 

t^^JtSiS^Zr°^l astern resources and provides facilities 
vices. We begin by ^discussinJ P ? re h ^ ™ PlementS ^ 8er " 

mode, then A^tffSSStaiSl^r^ USer Pr ° grams and kernel 
including system calls, ^p^nT^X and kernel ™ d *> 

Access to Kernel Services 



by^stg^ 5- ^ structures and hardware 

leged mode. PrivilegedTode is often r^To 

mode is referred to as user mode. relerred *<> as WeZ mode; nonprivileged 

underlying hardware ?he kernel £J?J ° f ** kernel ' S data struct »«* and the 
vent user process Jom leeSljSS?? T^** ^ non P rivile ^« mode to pre- 
affect other processes or th^Sf? f struct "™ or hardware registers that may 
instructions ZZZZS &£^^*Z£L ^ ^ ^ ^ 
nel data structures and hardware devices M medlate access to ker ' 



27 



BEST AVAILABLE COPY 



Kernel Services 



if « »«er orocess needs to access kernel system services, a thread within the pro- 
lf a user process nceas through a set of interfaces known 

C6SS SSTSb A system cS? allows a thread in a user process to switch into ker- 

order K, bcgm ~«*^^) ioS^pSftrm. the I/O on behalf of the colling 
&2? o£-£ 5 n"tp» -J mode, after which the user thread con- 
tinues normal execution. 




Figure 2.1 Switching into Kernel Mode via System Calls 



2.2 Entering Kernel Mode 



In addition to entering through system calls, the system can enter kernel mode for 

achieved in one of three ways: 

• Through a system call 

• As the result of an interrupt 

• As the result of a processor trap 

We defined a system call as the mechanism by which a user process requests a ker- 
nel service for example, to read from a file. System calls are typically initiated 
fan TuS mode by eiiher a trap instruction or a software interrupt, depending on 
[he m'crirocessor and platform. On SPARC-based platforms, system calls are ini- 
tiated by issuing a specific trap instruction in a L library stub. 



Entering Kernel Mode 

29 

bytllrdw'Z^vf Ve f t0red tra ? sfer * COntTo1 into the kernel > ^^ly initiated 
5) InteJZS ^rf!' J l isk COntroller signalling the completion of an 
occu a 2™n?c, T be initiated from software. Hardware interrupts typically 

^ context ° y He 6XeCUting thread ' and th ^ occur Winter* 

cesi'r °The S^T?^ I™™?* °* C ° ntro1 int ° the kerneI - * the pro- 

2b oc?u r P as T^J*^ 8 *"* 11 ^ and -terrupts « this: Traps typi- 
cally occur as a result of the current executing thread for examnle a 

thZ t err0r ,°.l a mem ° ry Pag6 feult ' -terrupfs are SnchronoTScits 

^thread OnVlw * related to the current* elZt 

i?fn^»l«,«*? t ■ P rocessors > the distinction is somewhat blurred since a 
trap is also the mechanism used to initiate interrupt handlers. 

2.2.1 Context 

2.2.1.1 Execution Context 

Threads in the kernel can execute in process, interrupt, or kernel context. 

* ^f?u eSS Context — In th e process context, the kernel thread acts on behalf 

cesslZZ Pr °T ^ aCCCSS t0 the pr ° CCSS ' s user a ™ <"<™»> anrpro. 
IZ 2fT reS ° Ur ^ accoun «ng- The uarea (struct u) is a special 
ke^J , n 6 P l OC6SS that C ° ntainS prOCCSS ^formation of interest to the 
SLC y ' P T eSS ' S ^ 016 Hst ' pr0CeSS identification informal 
S^SLf ♦ > P ' W V PrOCCSS 6XeCUteS a SyStem caI1 ' a ^read within 
Se Z ' Tf^ int ° k6rnel m ° de and then has access to the uarea of 
W;sage!etc. " S ° * P3SS ar ^ men ts, update system 

# ThetT nl C h ° nteXt ~ Inte u rrUpt threads execute in an interrupt context. 
S2r « ? have access to the data structures of the process or thread they 

SSKf P W th6ir ° Wn StaCk and Can access on * ke -el oata 

# I^rne?con^~ "f^ mana S eraent breads run in the kernel context. 
Znt ^L l \u ySt T mana Sement threads share the kernel's environ- 
mol^reltefdS er R Kern r , mana * cment typically cannot access 
~^tS5 JSS?" ° f ^ mana ^<* threads are the page 

2.2. 1 .2 Virtual Memory Context 

££££ T^^^^^£^-^^^ add — translations that 
y env "-onment. Each process has its own virtual memory con- 



If! <XP W7«wiflA I >. 



Kernel Services 



text When execution is switched from one process to another during asc^u ling 
swu'ch, the virtual memory context is switched to pnmde the new process s v>r- 
tual memory environment. 

Or, Intel and older SPARC architectures, each process context has a portion of 
the LSS^uaTmelry mapped within it, so that a virtual -,^ 

address spaces. 

2.2.2 Threads in Kernel and Interrupt Context 

In addition to providing kernel services through system calls, thckemel must also 



oerform system-related functions, sucn as r^™«" B flin4 . t ions tc 

forming some routine memory management, or moating scheduler functions tc 
switch execution from one kernel thread to another. 

. t „ ttmMt Handlers — Interrupts are directed to specific processors, and or 
receXn ^protls'r stops executing the current thread, context-switches 
the ?hread oS and begins executing an interrupt handling routine. Kerne 
S^EtLSe all but high-priority interrupts. Consequently, tte kernel ca 
minimize the amount of time spent holding critical resources, thus providing 
bTtrsTalabiUty of interrupt code and lower overall interrupt response time 
We discuss on kernel interrupts in more detail in "Interrupts" on page 38. 
. Kernel Management Threads - The Solaris kernel, just like a process 
h^rveJafof fts own threads of execution to carry out system 
tasks (the memory page scanner and NFS server are examples). Solaris ker 
Sf manag^rthVeads do not execute in a process's execut.on context 
Rath™ they execute in the kernel's execution context, sharing the kernel exe 
Environment with each other. Solaris kernel ^^^^ 
scheduled in the system (SYS) scheduling class at a higher priority than mos 
other threads on the system. 
Figure 2.2 shows the entry paths into the kernel for processes, interrupts, an 
threads. 



Entering Kernel Mode 



31 



f User \ 
/ \ Process J 

L d 



System Call Interface 



User Mode 



f System 
\ Call y 



Kernel Mode 




Interrupts are light- 
weight and do most 
of their work by 
scheduling an inter : 
rupt thread. 



Figure 2.2 Process. Interrupt and Kernel Threads 



2.2.3 UltraSPARC I & II Traps 

Sstem^S P nro?f ° r archite f cture us f 5 tra P s as a unified mechanism to handle 
calHmt'ated hv tZ eXCept, ° ns ' and interrupts. A SPARC trap is a procedure 
don ' I t y hG micr °P l oc essor as a result of a synchronous processor excen 
aTevfce HEST" PrOCGSS ° r * ^ware-initiated trap instruct*^ 

any^cf Ceipt of ** r *P> th ° UltraSPARC I & II processor enters privileged mode 
trap ft^J 8 instructions, starting at a predetermined locSon^n a 
i K % ? h3I ]- dler for the type of tra P receiv «d is executed, and once the 

' fy%! «cT° ""*'*"*"' Stat ° <Pr0Sram C ° Unters ' conditi<,n code "»P 

• Enter privileged execution mode 

• Begin executing code in the corresponding trap table slot 

oTreTRY SSSSS?? h3 ? d, .l r Pr ° CeSSing " C ° m P lete ' * i8SUeS * SPARC DONE 
or retry instruction to return to the interrupted thread. 



ID: <XP 22.47*48* I * 



Kernel Services 



32 



rrv, TTH^qPARC I & II trap table is an in-memory table that contains the first 
The UltraSPARC 1 & 11 «wp ^ located in memory at the 

eight instructions £^^^Si2^5rt-r (TBA), which is initialized 
address stored m the trap table ™ ,| f the kerne i (known as ker- 
during boot. Solaris places the ^^^Jjji"^ so that no mem- 
nelbase) in a locked-down ^° n -P*| ea ^> * ^Xoccuf during execution of 
S^atttfi^.Si^ memory map, see Append, B, 
"Kernel Virtual Address Maps' .) 



2.2.3.1 UltraSPARC I & II Trap Types 



following broad types: 



• Processor resets — Power-on 
resets 



reset, machine resets, software-initiated 



• Memory management exceptions - MMU page faults, page protection 

" violations, memory errors, misaligned accesses, etc - 

. Instruction exceptions - Attempts to execute privileged mstructions from 

nonprivileged mode, illegal instructions, etc. 
. Floating-point exceptions - Floating-point exceptions floatmg-poxnt 

mode nstruction attempted when floating point unit disabled, etc. 
. SPARC register management - Traps for SPARC renter window sprll- 

. tli^!^!^^ - Traps initiated by the SPARC trap instruction 
(Tec); primarily used for system call entry in Solaris. 

Table 2-1 shows the UltraSPARC I & II trap types, as implemented in Solaris. 
Table 2-1 Solaris UltraSPARC I & II Traps 



Trap Definition 



Power-o n reset 

Wat chdog reset 

Externally initiated reset 
Software-initiated_reset_ 
RI^D"statel»xception 
"Reserved 



Trap Type 



001^ 
002 



003 



Instruction jicc ess ex ception 
Instawtion"ao»ss JflMU miss 



Instruction access e rror 
Reserved 



Illegal instruction 



004_ 

"005 

0067 



Priority 



0 



.007 



008^ 
"009 



00A 

00B„ 

010 



:oof 



1_ 

i 

n/a 



n/a 



Entering Kernel Mode 



33 



Table 2-1 Solaris UltraSPARC I & II Traps (Continued) 



±lajJ -L»tsl 1I1X tlOll 


Trap Type 


Priority 


Attempt to execute privileged instruction 


Oil 


6 


Unimplemented load instruction 


012 


6 


Unimplemented store instruction 


013 


6 


Reserved 


014.. .OIF 


n/a 


r ioai.ing-point unit disabled 


020 


8 1 


Floating-point exception ieee754 


021 


11 


Floating-point exception — other 


022 


11 


Tag overflow 


023 


14 


or'AKO register window clean 


024... 027 


10 


Division by zero ~ 


028 


15 


Intern a i processor error 


029 


4 


Data access exception 


030 


12 


Data access MMU miss 


031 


12 


Data access error 


032 


12 


Data access protection 


033 


12 


iviemory auciress not aligned 

Load double memory address not aligned 


034 


10 


03*5 


10 


Store double memory address not aligned 


036 


10 


iiiviicgcu action 


037 


11 


Load Quad memorv address not* aMo-noA 


uoo 


10 


Store quad memory address not aligned 


039 


10 


Reserved 


03A...03F 


n/a 


Asynchronous data error 


040 


2 


Interrupt level n, where /z=1...15 


041... 04F 


32-n 


Reserved 


050... 05F 


n/a 


Vectored interrupts 


060... 07F 


Int. Specific 


SPARC register window overflows 


080... 0BF 


9 


SPARC register window underflows 


0C0...0FF 


9 


Trap instructions Tec 


100... 17F 


16 


Reserved 


180... IFF 


n/a 



2.2.3.2 UltraSPARC I & II Trap Priority Levels 



Each UltraSPARC I & II trap has an associated priority level. The processor's trap 
hardware uses the level to decide which trap takes precedence when more than 
one trap occurs on a processor at a given time. When two or more traps are pend- 
ing, the highest-priority trap is taken first (0 is the highest priority). 

Interrupt traps are subject to trap priority precedence. In addition, interrupt 
traps are compared against the processor interrupt level (PIL). The UltraSPARC I 
& II processor will only take an interrupt trap that has an interrupt request level 



Kernel Services 



greater than that stored in the processor's PIL register. We discuss this behavior in 
more detail in "Interrupts" on page 38. 

2.2.3.3 UltraSPARC I & II Trap Levels 

The UltraSPARC I & II processor introduced nested traps; that is, a trap can be 
received while another trap is being handled. Prior ^ C ™*%™^™™™ 
not handle nested traps (a "watchdog reset" occurs on pre-Ultr jSPARC 
if a trap occurs while the processor is executing a trap handler). Also introduced 
was the notion of trap levels to describe the level of trap nesting. The nested [traps 
have five levels, starting at trap level 0 (normal execution no trap) through trap 
level 4 (trap level 4 is actually an error handling state and should not be reached 
during normal processing). 

When an UltraSPARC I & II trap occurs, the CPU increments the trap level 
(TL) The most recent processor state is saved on the trap stack, and the trap nan- 
dler is entered. On exit from the handler, the trap level is decremented. 

UltraSPARC I & II also implements an alternate set of global registers for each 
trap level. Those registers remove most of the overhead associated with saving 
state, making it very efficient to move between trap levels. 

2.2.3.4 UltraSPARC I & II Trap Table Layout 

The UltraSPARC I & II trap table is halved: the lower half contains trap handlers 
for traps taken at trap level 0, and the upper half contains handlers for traps 
taken when the trap level is 1 or greater. We implement separate trap handlers tor 
traps taken at trap levels greater than zero (i.e., we are already handling a trap; 
because not all facilities are available when a trap is taken within a trap. 

For example, if a trap handler at trap level 0 takes a memory-related trap (such 
as a translation miss), the trap handler can assume a higher-level trap handler 
will take care of the trap; but a higher-level trap handler cannot always make tht 
same assumption. Each half of the trap table contains 512 trap handler slots, ont 
for each trap type shown in Table 2-1. 

Each half of the trap table is further divided into two sections, each of whicl 
contains 256 hardware traps in the lower section, followed by 256 software traps u 
the upper section (for the SPARC Tec software trap instructions). Upon receipt o 
a trap the UltraSPARC I & II processor jumps to the instructions located in th« 
trap table at the trap table base address (set in the TBA register) plus the offset o 
the trap level and trap type. There are 8 instructions (32 bytes) at each slot in th. 
table; hence, the trap handler address is calculated as follows: 

TL = 0: trap handler address = TBA + (trap type x 32) 

TL > 0: trap handler address = TBA + 512 + (trap type x 32) 

As a side note, space is reserved in the trap table so that trap handlers for SPAR( 
that register clean, spill, and fill (register window operations) can actually b 



Entering Kernel Mode 



35 



longer than 8 instructions. This allows branchless inline handlers to be 
mented such that the entire handler fits within the trap table slot. 

Figure 2.3 shows the UltraSPARC I & II trap table layout. 





Trap Table Contents 


Trap Types 




Hardware Traps 


000...07F 


Trap Level = 0 


Spill/Fill Traps 


080...0FF 




Software Traps 


100...17F 




Reserved 


180...1FF 




Hardware Traps 


000...07F 


Trap Level > 0 


Spill/Fill Traps 


080...0FF 




Software Traps 


100...17F 




Reserved 


180...1FF 



Figure 2.3 UltraSPARC I & II Trap Table Layout 
2.2.3.5 Software Traps 

Software traps are initiated by the SPARC trap instruction, Tec. The opcode for 
the trap instruction includes a 6-bit software trap number, which indexes into the 
software portion of the trap table. Software traps are used primarily for system 
calls in the Solaris kernel. 

e T ,Sf * e . are three software ^aps for system calls: one for native system calls one 
for 32-bit system calls (when 32-bit applications are run on a 64-bit kernel), and 
one for SunOS 4.x binary compatibility system calls. System calls vector through a 
common trap by setting the system call number in a global register and then issu- 
ing a trap instruction. We discuss regular systems calls in more detail in "System 
Calls on page 44. 

There are also several ultra-fast system calls implemented as their own trap, 
these system calls pass their simple arguments back and forth via registers 
Because the system calls don't pass arguments on the stack, much less of the pro- 
cess state needs to be saved during transition into kernel mode, resulting in a 
much faster system call implementation. The fast system calls (eg 
get_hrestime) are time-related calls. 

Table 2-2 lists UltraSPARC software traps, including ultra-fast system calls. 



Kernel Services 



Table 2-2 UltraSPARC Software Traps 



Trap Definition 



TnitwctionJSunOS 4x syscalls) 
Instruction _(^er breakpoints)^ 
rirStruction WivWe by zero)^ 
instruction (flush windows) 



, Trap_ 
|_Trap : 

trap 
jTrap 

Trap 

Trap 

■ Trap-- 

Trap instructionJseUrapO) 

Traplnstruc^ons^user traps) 

Trap'uis^ctions (^^hrtime)^ 
ITr'apmstr^ions^e^rvj^e^ 
Trap instruc^ns_^lf_^call) 



instruction (clean _windows) 

IrTstrurtionjdb unaligned reJSeren^es) 
instruction (3 2-bit system cal})_ 



Trap Type 
Value 

Too 

101 



102 
103" 



Priority 



16 
ljT 
16 
16 



104_ 
106 

l08_ 
T69 



16 



16_ 
16 



16 



110 - 123 



instruc tions (get _ hrestime) 
instructions (trace) 



Trap instruct — 

Tra'plnstiu^nsJ^^ 

2 2 3.6 A Utility for Trap Analysis 

An unbundled tool, trapse, dynamically 

" ^^^^^^ " and mtel 
X 86 processor architectures, on Solaris 7 and later releases. 

driver with the add_drv command. 

Note: trapstat is not supported by Sun. Do not use it on production machines 
because it dynamically loads code into the kernel. 



124 



125 

126_ 

"127 



16 



130-137 



140 



16_ 
16 
16 
16 



_16_ 
16 



« tar xvf e "»^ at29 -'" Jan 31 03: s7 2000 /usr /bin/ trapstat 

-r-xr-xr-x 0/2 5268 Jan 31 03 /usr / b in/sparcv7/trapstat 

-rwxrwxr-x 0/1 ««2 £b 10 Ml / /uBr / b i„/.p.rcv9/tr*-t- t 

-rwxrwxr-x 0/1 40432 FeD x /usr /kernel/drv/trapstat 

-rw-r-r- 0/1 21224 Sep 8 17 28 /usr/kernel/drv/trapstat conf 

I^'-rw^:: 0/1 37128 Sep 8 17:28 1998 /u S r/kemel/drv/sparcv9/tra P stat 
ft add_drv trapstat 



c 



Entering Kernel Mode 



37 



Once trapstat is installed, use it to analyze the traps taken on each processor 
installed in the system. 



# traps tat 3 



vet 


name 


| cpuO 


cpul 


24 


cleanwin 


| 3636 


4285 


41 


level -1 


| 99 


1 


45 


level -5 


1 1 


0 


46 


level -6 


j 60 


0 


47 


level -7 


[ 23 


0 


4a 


level -10 


j 100 


0 


4d 


level -13 


1 31 


67 


4e 


level -14 


j 100 


0 


60 


int-vec 


j 161 


90 


64 


itlb-miss 


I 5329 


11128 


68 


dt lb-miss 


\ 39130 


82077 


6c 


dtlb-prot 


1 3 


2 


84 


spill- 1 -normal 


| 1210 


992 


Be 


spill -3 -normal 


| 136 


286 


98 


spi 1 1 - 6 - normal 


5752 


20286 


a4 


spill -1 -other 


476 


1116 


ac 


spi 11 -3 -other 


4782 


9010 


C4 


f ill-l-normal 


1218 


752 


cc 


fill -3 -normal 


3725 


7972 


da 


f ill -6-normal 


5576 


20273 


103 


flush-wins 


31 


0 


108 


syscall-32 


2609 


3813 


124 


getts 


1009 


2523 


127 


gethrtime j 


1004 


477 


ttl 


1 


76401 


165150 



The example above shows the traps taken on a two-processor Ultra- 
SPARC-II-based system. The first column shows the trap type, followed by an 
ASCII description of the trap type. The remaining columns are the trap counts for 
each processor. 

We can see that most trap activities in the SPARC are register clean, spill, and 
fill traps — they perform SPARC register window management. The level-1 through 
level 14 and int-vec rows are the interrupt traps. The itlb-miss, dtlb-miss, and 
dtlb-prot rows are the UltraSPARC memory management traps, which occur each 
time a TLB miss or protection fault occurs. (More on UltraSPARC memory man- 
agement in "The UltraSPARC-I and -II HAT" on page 193.) At the bottom of the 
output wc can see the system call trap for 32-bit systems calls and two special 
ultra-fast system calls (getts and gethrtime), which each use their own trap. 

The SPARC V9 Architecture Manual [301 provides a full reference for the imple- 
mentation of UltraSPARC traps. We highly recommend this text for specific imple- 
mentation details on the SPARC V9 processor architecture. 



XIID: <XP 2257548A I > 



38 



Kernel Services 



2.3 Interrupts 



An interrupt is the mechanism that a device uses to signal the kernel that it needs 
attention and some immediate processing is required on behalf of that device. 
Solaris services interrupts by context-switching out the current thread running on 
a processor and executing an interrupt handler for the interrupting device. For 
example, when a packet is received on a network interface, the network controller 
initiates an interrupt to begin processing the packet. 

2.3.1 Interrupt Priorities 

Solaris assigns priorities to interrupts to allow overlapping interrupts to be han- 
dled with the correct precedence; for example, a network interrupt can be config- 
ured to have a higher priority than a disk interrupt. 

The kernel implements 15 interrupt priority levels: level 1 through level 15, 
where level 15 is the highest priority level. On each processor, the kernel can mask 
interrupts below a given priority level by setting the processor's interrupt level. 
Setting the interrupt level blocks all interrupts at the specified level and lower. 
That way, when the processor is executing a level 9 interrupt handler, it does not 
receive interrupts at level 9 or below; it handles only higher-priority interrupts. 



High Interrupt Priority Level 



PIO Serial Interrupts 

Clock Interrupt 

Network Interrupts 
Disk Interrupts 



Low Interrupt Priority Level 



Interrupts at level 
10 or below are 
handled by interrupt 
threads. Clock 
interrupts are handled 
by a specific clock 
interrupt handler 
kernel thread. There 
is one clock interrupt 
thread systemwide. 



Figure 2.4 Solaris Interrupt Priority Levels 

Interrupts that occur with a priority level at or lower than the processor's inter- 
rupt level are temporarily ignored. An interrupt will not be acknowledged by a pro- 
cessor until the processor's interrupt level is less than the level of the pending 



Interrupts 



39 



interrupt. More important interrupts have a higher-priority level to give them a 
better chance to be serviced than lower-priority interrupts. 

Figure 2.4 illustrates interrupt priority levels. 
2.3.1.1 Interrupts as Threads 

Interrupt priority levels can be used to synchronize access to critical sections used 
by interrupt handlers. By raising the interrupt level, a handler can ensure exclu- 
sive access to data structures for the specific processor that has elevated its prior- 
ity level. This is in fact what early, uniprocessor implementations of UNIX systems 
did for synchronization purposes. 

But masking out interrupts to ensure exclusive access is expensive; it blocks 
other interrupt handlers from running for a potentially long time, which could lead 
to data loss if interrupts are lost because of overrun. (An overrun condition is one 
in which the volume of interrupts awaiting service exceeds the system's ability to 
queue the interrupts.) In addition, interrupt handlers using priority levels alone 
cannot block, since a deadlock could occur if they are waiting on a resource held by 
a lower-priority interrupt. 

For these reasons, the Solaris kernel implements most interrupts as asynchro- 
nously created and dispatched high-priority threads. This implementation allows 
the kernel to overcome the scaling limitations imposed by interrupt blocking for 
synchronizing data access and thus provides low-latency interrupt response times. 

Interrupts at priority 10 and below are handled by Solaris threads. These inter- 
rupt handlers can then block if necessary, using regular synchronization primi- 
tives such as mutex locks. Interrupts, however, must be efficient, and it is too 
expensive to create a new thread each time an interrupt is received. For this rea- 
son, each processor maintains a pool of partially initialized interrupt threads, one 
for each of the lower 9 priority levels plus a systemwide thread for the clock inter- 
rupt. When an interrupt is taken, the interrupt uses the interrupt thread's stack, 
and only if it blocks on a synchronization object is the thread completely initial- 
ized. This approach, exemplified in Figure 2.5, allows simple, fast allocation of 
threads at the time of interrupt dispatch. 



40 



Kernel Services 



0 

Executing Thread 



Thread is interrupted. 



Thread is resumed. 



0 



CD 



Interrupt thread from 
CPU thread poo! for the 
priority of the interrupt 
handles the interrupt. 





9 


CPU 




8 


Interrupt 




7 


Threads 




6 






5 






4 






3 






2 






1 





Figure 2.5 Handling Interrupts with Threads 

Figure 2.5 depicts a typical scenario when an interrupt with priority 9 or less 
occurs (level 10 clock interrupts are handled slightly differently). When an inter- 
rupt occurs, the interrupt level is raised to the level of the interrupt to block subse- 
quent interrupts at this level (and lower levels). The currently executing thread is 
interrupted and pinned to the processor. A thread for the priority level of the inter- 
rupt is taken from the pool of interrupt threads for the processor and is con- 
text-switched in to handle the interrupt. 

The term pinned refers to a mechanism employed by the kernel that avoids con- 
text switching out the interrupted thread. The executing thread is pinned under 
the interrupt thread. The interrupt thread "borrows" the LWP from the executing 
thread. While the interrupt handler is running, the interrupted thread is pinned to 
avoid the overhead of having to completely save its context; it cannot run on any 
processor until the interrupt handler completes or blocks on a synchronization 
object. Once the handler is complete, the original thread is unpinned and resched- 
uled. 

If the interrupt handler thread blocks on a synchronization object (e.g., a mutex 
or condition variable) while handling the interrupt, it is converted into a complete 
kernel thread capable of being scheduled. Control is passed back to the inter- 
rupted thread, and the interrupt thread remains blocked on the synchronization 
object. When the synchronization object is unblocked, the thread becomes runna- 
ble and may preempt lower-priority threads to be rescheduled. 

The processor interrupt level remains at the level of the interrupt, blocking 
lower^priority interrupts, even while the interrupt handler thread is blocked. This 
prevents lower-priority interrupt threads from interrupting the processing of 
higher-level interrupts. While interrupt threads are blocked, they are pinned to 
the processor they initiated on, guaranteeing that each processor will always have 
an interrupt thread available for incoming interrupts. 



Interrupts 



41 



Level 10 clock interrupts are handled in a similar way, but since there is only 
one source of clock interrupt, there is a single, systemwide clock thread. Clock 
interrupts are discussed further in "The System Clock" on page 54. 

2.3. 1 .2 Interrupt Thread Priorities 

Interrupts that are scheduled as threads share global dispatcher priorities with 
other threads. See Chapter 9, "The Solaris Kernel Dispatcher" for a full descrip- 
tion of the Solaris dispatcher. Interrupt threads use the top ten global dispatcher 
priorities, 160 to 169. Figure 2.6 shows the relationship of the interrupt dis- 
patcher priorities with the real-time, system (kernel) threads and the timeshare 
and interactive class threads. 



+60 




160-169 




Figure 2.6 Interrupt Thread Global Priorities 
2.3.1.3 High-Priority Interrupts 

Interrupts above priority 10 block out all lower-priority interrupts until they com- 
plete. For this reason, high-priority interrupts need to have an extremely short 
code path to prevent them from affecting the latency of other interrupt handlers 
and the performance and scalability of the system. High-priority interrupt threads 
also cannot block; they can use only the spin variety of synchronization objects. 
This is due to the priority level the dispatcher uses for synchronization. The dis- 
patcher runs at level 10, thus code running at higher interrupt levels cannot enter 
the dispatcher. High-priority threads typically service the minimal requirements of 
the hardware device (the source of the interrupt), then post down a lower-priority 
software interrupt to complete the required processing. 



Kernel Services 



2.3.1.4 UltraSPARC Interrupts 

On UltraSPARC systems (sun4u), the intr_vector [] array is a single, system- 
wide interrupt table for all hardware and software interrupts, as shown in Figure 
2.7. 



intr vector 



ivjianaier 

iv_arg 

iv_pil 

iv_pending 



Iv mutex 



intr vector 



Solaris 2.5.1 
& 2.6 only 



Figure 2.7 Interrupt Table on sun4u Architectures 

Interrupts are added to the array through an add_ivintr ( ) function. (Other 
platforms have a similar function for registering interrupts.)~Each interrupt regis- 
tered with the kernel has a unique interrupt number that locates the handler 
information in the interrupt table when the interrupt is delivered. The interrupt 
number is passed as an argument to add_ivintr ( ) , along with a function pointer 
(the interrupt handler, iv_handler), an argument list for the handler (iv_arg), 
and the priority level of the interrupt (iv_pil). 

Solaris 2.5.1 and Solaris 2.6 allow for unsafe device drivers — drivers that have 
not been made multiprocessor safe through the use of locking primitives. For 
unsafe drivers, a mutex lock locks the interrupt entry to prevent multiple threads 
from entering the driver's interrupt handler. 

Solaris 7 requires that all drivers be minimally MP safe, dropping the require- 
ment for a lock on the interrupt table entry. The iv_j?ending field is used as part 
of the queueing process; generated interrupts are placed on a per-processor list of 
interrupts waiting to be processed. The pending field is set until a processor pre- 
pares to field the interrupt, at which point the pending field is cleared. 

A kernel add__sof tintr () function adds software-generated interrupts to the 
table. The process is the same for both functions: use the interrupt number passed 
as an argument as an index to the intr_vector [] array, and add the entry. The 
size of the array is large enough that running out of array slots is unlikely. 



2.3.2 Interrupt Monitoring 

You can use the mpstat(lM) and vmstat(lM) commands to monitor interrupt 
activity on a Solaris system. mpstat(lM) provides interrupts-per-second for each 



Interrupts 



43 



CPU in the intr column, and interrupts handled on an interrupt thread (low-level 
interrupts) in the ithr column. 



tt mpstat 3 

CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 

0 5 0 7 39 12 250 17 9 18 0 72S 4 2 0 94 

1 4 0 10 278 83 275 40 9 40 0 941 4 2 0 93 



2.3.3 Interprocessor Interrupts and Cross-Calls 

The kernel can send an interrupt or trap to another processor when it requires 
another processor to do some immediate work on its behalf. Interprocessor inter- 
rupts are delivered through the poke__cpu ( ) function; they are used for the fol- 
lowing purposes: 

• Preempting the dispatcher — A thread may need to signal a thread run- 
ning on another processor to enter kernel mode when a preemption is 
required (initiated by a clock or timer event) or when a synchronization object 
is released. Chapter 9, "The Dispatcher," further discusses preemption. 

• Delivering a signal — The delivery of a signal may require interrupting a 
thread on another processor. 

• Starting/stopping /proc threads — The /proc infrastructure uses inter- 
processor interrupts to start and stop threads on different processors. 

Using a similar mechanism, the kernel can also instruct a processor to execute a 
specific low-level function by issuing a processor-to-processor cross-call. Cross-calls 
are typically part of the processor-dependent implementation. UltraSPARC ker- 
nels use cross-calls for two purposes: 

• Implementing interprocessor interrupts — As discussed above. 

• Maintaining virtual memory translation consistency — Implementing 
cache consistency on SMP platforms requires the translation entries to be 
removed from the MMU of each CPU that a thread has run on when a vir- 
tual address is unmapped. On UltraSPARC, user processes issuing an unmap 
operation make a cross-call to each CPU on which the thread has run, to 
remove the TLB entries from each processor's MMU. Address space unmap 
operations within the kernel address space make a cross-call to all proces- 
sors for each unmap operation. 

Both cross-calls and interprocessor interrupts are reported by mpstat (1M) in the 
xcal column as cross-calls per second. 



U mpstat 3 

CPU minf mjf xcal intr 

0 0 0 6 607 

1 0 0 2 218 



ithr csw icsw migr smtx 
246 1100 174 82 84 
0 1037 212 83 80 



srw syscl usr sys wt idl 
0 2907 28 5 0 66 
0 3438 33 4 0 62 



44 



Kernel Services 



High numbers of reported cross-calls can result from either of the activities men- 
tioned in the preceding section— most commonly, from kernel address space unmap 
activity caused by file system activity. 



2.4 System Calls 



Recall from "Access to Kernel Services" on page 27, system calls are interfaces call- 
able by user programs in order to have the kernel perform a specific function (e.g., 
opening a file) on behalf of the calling thread. System calls are part of the applica- 
tion programming interfaces (APIs) that ship with the operating system; they are 
documented in Section 2 of the manual pages. The invocation of a system call 
causes the processor to change from user mode to kernel mode. This change is 
accomplished on SPARC systems by means of the trap mechanism previously dis- 
cussed. 

2.4.1 Regular System Calls 

System calls are referenced in the system through the kernel sysent table, which 
contains an entry for every system call supported on the system. The sysent table 
is an array populated with sysent structures, each structure representing one 
system call, as illustrated in Figure 2.8. 



sysent 











\ 

\ 

\ 

\ 

\ 


sy_narg 
sy flags 
(sy_call()) 
syjock 
(sy_callc()) 









I 

I 



Figure 2.8 The Kernel System Call Entry (sysent) Table 

The array is indexed by the system call number, which is established in the 
/etc/name_to_sysnum file. Using an editable system file provides for adding sys- 
tem calls to Solaris without requiring kernel source and a complete kernel build. 
Many system calls are implemented as dynamically loadable modules that are 
loaded into the system when the system call is invoked for the first time. Loadable 
system calls are stored in the /kernel/sys and /usr/kernel/sys directories. 

The system call entry in the table provides the number of arguments the sys- 
tem call takes (sy_narg), a flag field <sy_flag), and a reader/writer lock 



Calls 



45 



(sy_lock) for loadable system calls. The system call itself is referenced through a 
function pointer: sy_call or sy_callc. 



Historical Aside: The fact that there are two entries for the system call func- 
tions is the result of a rewriting of the system call argument-passing imple- 
mentation, an effort that first appeared in Solaris 2.4. Earlier Solaris versions 
passed system call arguments in the traditional UNIX way: bundling the argu- 
ments into a structure and passing the structure pointer {uap is the historical 
name in UNIX implementations and texts; it refers to a user argument pointer). 
Most of the system calls in Solaris have been rewritten to use the C language 
argument-passing convention implemented for function calls. Using that con- 
vention provided better overall system call performance because the code can 
take advantage of the argument-passing features inherent in the register win- 
dow implementation of SPARC processors (using the in registers for argument 
passing — refer to [31] for a description of SPARC register windows). 



sy__call represents an entry for system calls and uses the older uap pointer con- 
vention, maintained here for binary compatibility with older versions of Solaris. 
sy_callc is the function pointer for the more recent argument-passing implemen- 
tation. The newer C style argument passing has shown significant overall perfor- 
mance improvements in system call execution — on the order of 30 percent in some 
cases. 



main () 
! int 



fdl • 
int bytes ; 
fd=open<*file", 
if (fd -1* 

perror ( *open 
exit(^)^ f 
else { ' 



) 



1 



bytessread.(£d«bu£# - 



user mode 



system call trap 

_^ trap into kernel 

enter system call trap handler 



Save CPU structure pointer 
Save return address 
Increment CPU syslnfo syscall count 
Save args In LWP 
if (Lpre_sys) 

do syscall preprocessing 
Load syscall number in Lsysnum 
. Invoke syscall 

open() 
return 
Any signals posted? 
if (Lpost_sys) 

do postprocessing 
' Set return value from syscall 
return 



Figure 2.9 System Call Execution 



Kernel Services 



The execution of a system call results in the software issuing a trap instruction, 
which is how the kernel is entered to process the system call. The trap handler for 
the system call is entered, any necessary preprocessing is done, and the system 
call is executed on behalf of the calling thread. The flow is illustrated in Figure 2.9. 

When the trap handler is entered, the trap code saves a pointer to the CPU 
structure of the CPU on which the system call will execute, saves the return 
address, and increments a system call counter maintained on a per-CPU basis. The 
number of system calls per second is reported by mpstat(lM) (syscl column) for 
per-CPU data; systemwide, the number is reported by vmstat(lM) (sy column). 

Two flags in the kernel thread structure indicate that pre-system call or 
post-system call processing is required. The t_pre_sys flag (preprocessing) is set 
for things like truss(l) command support (system call execution is being traced) 
or microstate accounting being enabled. Post-system-call work (t_j>ost_sys) may 
be the result of /proc process tracing, profiling enabled for the process, a pending 
signal, or an exigent preemption. In the interim between pre- and postprocessing, 
the system call itself is executed. 



2.4.2 Fast Trap System Calls 

The overhead of the system call framework is nontrivial; that is, there is some 
inherent latency with all system calls because of the system call setup process we 
just discussed. In some cases, we want to be able to have fast, low-latency access to 
information, such as high-resolution time, that can only be obtained in kernel 
mode. The Solaris kernel provides a fast system call framework so that user pro- 
cesses can jump into protected kernel mode to do minimal processing and then 
return, without the overhead of the full system call framework. This framework 
can only be used when the processing required in the kernel does not significantly 
interfere with registers and stacks. Hence, the fast system call does not need to 
save all the state that a regular system call does before it executes the required 
functions. 

Only a few fast system calls are implemented in Solaris versions up to Solaris 7: 
gethrtime ( ) , gethrvt ime ( ) , and gettimeof day { ) . These functions return 
time of day and processor CPU time. They simply trap into the kernel to read a 
single hardware register or memory location and then return to user mode. 

Table 2-3 compares the average latency for the getpid ( )/time ( ) system calls 
and two fast system calls. For reference, the latency of a standard function call is 
also shown. Times were measured on a 300 MHz Ultra2. Note that the latency of 
the fast system calls is about five times lower than that of an equivalent regular 
system call. 



